For this assignment we will apply some of the approaches presented in ISLR for variable selection and model regularization to some of those datasets that we have worked with previously. The goal will be to see whether some of the more principled methods for model selection will allow us better understand relative variable importance, variability of predictive performance of the models, etc.
For the purposes of the preface we will use abalone dataset to illustrate some of the concepts and approaches here. The problems in the assignment will use computer hardware dataset from the last week assignment. The flow below follows closely the outline of the Labs 6.5 and 6.6 in ISLR and you are encouraged to refer to them for additional examples and details.
Assuming that we have read and pre-processed abalone data (converted rings to age, log-transformed, removed height outliers – two zeroes and two largest values), let’s use regsubsets from library leaps to select optimal models with number of terms ranging from one to all variables in the dataset using each of the methods available for this function and collect corresponding model metrics (please notice that we override default value of nvmax argument and reflect as to why we do that):
summaryMetrics <- NULL
whichAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward", "seqrep") ) {
rsRes <- regsubsets(age~.,lnAbaDat,method=myMthd,nvmax=9)
summRes <- summary(rsRes)
whichAll[[myMthd]] <- summRes$which
for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
summaryMetrics <- rbind(summaryMetrics,
data.frame(method=myMthd,metric=metricName,
nvars=1:length(summRes[[metricName]]),
value=summRes[[metricName]]))
}
}
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") + theme(legend.position="top")
We can see that, except for sequential replacement that has chosen quite a model as the best with four variables, all others came with models of very comparable performance by every associated metric. Plotting variable membership for each of those models as captures by which attribute of the summary further illustrates that the variables chosen by sequential replacement for four variable model were sex and highly correlated length and diameter explaining its poor performance but not its choice by this algorithm:
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in names(whichAll) ) {
image(1:nrow(whichAll[[myMthd]]),
1:ncol(whichAll[[myMthd]]),
whichAll[[myMthd]],xlab="N(vars)",ylab="",
xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
col=c("white","gray"),main=myMthd)
axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}
par(old.par)
Next, following Lab 6.5.3 in ISLR we will split our data into training and test, select best subset of variables on training data, evaluate its performance on training and test and record which variables have been selected each time. First, to be able to use regsubsets output to make predictions we follow ISLR and setup predict function that can be applied to the output from regsubsets (notice .regsubsets in its name – this is how under S3 OOP framework in R methods are matched to corresponding classes – we will further down call it just by passing output from regsubsets to predict – this, in its turn, works because function regsubsets returns object of class regsubsets):
predict.regsubsets <- function (object, newdata, id, ...){
form=as.formula(object$call [[2]])
mat=model.matrix(form,newdata)
coefi=coef(object,id=id)
xvars=names (coefi)
mat[,xvars] %*% coefi
}
We are all set now to draw training sets, choose best set of variables on them by each of the four different methods available in regsubsets, calculate test error, etc. To summarize variable selection over multiple splits of the data into training and test, we will use 3-dimensional array whichSum – third dimension corresponding to the four methods available in regsubsets. To split data into training and test we will use again sample function – those who are curious and are paying attention may want to reflect on the difference in how it is done below and how it is implemented in the Ch. 6.5.3 of ISLR and what are the consequences of that. (Hint: consider how size of training or test datasets will vary from one iteration to another in these two implementations)
dfTmp <- NULL
whichSum <- array(0,dim=c(9,10,4),
dimnames=list(NULL,colnames(model.matrix(age~.,lnAbaDat)),
c("exhaustive", "backward", "forward", "seqrep")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(lnAbaDat)))
# Try each method available in regsubsets
# to select best model of each size:
for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
rsTrain <- regsubsets(age~.,lnAbaDat[bTrain,],nvmax=9,method=jSelect)
# Add up variable selections:
whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
# Calculate test error for each set of variables
# using predict.regsubsets implemented above:
for ( kVarSet in 1:9 ) {
# make predictions:
testPred <- predict(rsTrain,lnAbaDat[!bTrain,],id=kVarSet)
# calculate MSE:
mseTest <- mean((testPred-lnAbaDat[!bTrain,"age"])^2)
# add to data.frame for future plotting:
dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
}
}
}
# plot MSEs by training/test, number of
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)
We can see that:
This is further supported by plotting average fraction of each variable inclusion in best model of every size by each of the four methods (darker shades of gray indicate closer to unity fraction of times given variable has been included in the best subset):
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in dimnames(whichSum)[[3]] ) {
tmpWhich <- whichSum[,,myMthd] / nTries
image(1:nrow(tmpWhich),1:ncol(tmpWhich),tmpWhich,
xlab="N(vars)",ylab="",xaxt="n",yaxt="n",main=myMthd,
breaks=c(-0.1,0.1,0.25,0.5,0.75,0.9,1.1),
col=c("white","gray90","gray75","gray50","gray25","gray10"))
axis(1,1:nrow(tmpWhich),rownames(tmpWhich))
axis(2,1:ncol(tmpWhich),colnames(tmpWhich),las=2)
}
par(old.par)
From best subset of about four or more variable inclusion starts to vary more among different selection of training and test sets.
Similar observations can be made using cross-validation rather split of the dataset into training and test that is omitted here for the purposes of brevity.
As explained in the lecture and ISLR Ch.6.6 lasso and ridge regression can be performed by glmnet function from library glmnet – its argument alpha governs form of the shrinkage penalty, so that alpha=0 corresponds to ridge and alpha=1 – to lasso regression. The arguments to glmnet differ from those used for lm for example and require specification of the matrix of predictors and outcome separately. model.matrix is particularly helpful for specifying matrix of predictors by creating dummy variables for categorical predictors:
# -1 to get rid of intercept that glmnet knows to include:
x <- model.matrix(age~.,lnAbaDat)[,-1]
head(lnAbaDat)
## sex len diam h ww sw vw
## 1 M -0.7874579 -1.0078579 -2.353878 -0.6655320 -1.493880 -2.292635
## 2 M -1.0498221 -1.3280255 -2.407946 -1.4894351 -2.307598 -3.026191
## 3 F -0.6348783 -0.8675006 -2.002481 -0.3900840 -1.360627 -1.955456
## 4 M -0.8209806 -1.0078579 -2.079442 -0.6616485 -1.534794 -2.171557
## 5 I -1.1086626 -1.3664917 -2.525729 -1.5847453 -2.413517 -3.231455
## 6 I -0.8556661 -1.2039728 -2.353878 -1.0455456 -1.958995 -2.557477
## sh age
## 1 -1.897120 2.803360
## 2 -2.659260 2.140066
## 3 -1.560648 2.351375
## 4 -1.864330 2.442347
## 5 -2.900422 2.140066
## 6 -2.120264 2.251292
# notice how it created two columns for sex (first level is for intercept):
head(x)
## sexI sexM len diam h ww sw vw
## 1 0 1 -0.7874579 -1.0078579 -2.353878 -0.6655320 -1.493880 -2.292635
## 2 0 1 -1.0498221 -1.3280255 -2.407946 -1.4894351 -2.307598 -3.026191
## 3 0 0 -0.6348783 -0.8675006 -2.002481 -0.3900840 -1.360627 -1.955456
## 4 0 1 -0.8209806 -1.0078579 -2.079442 -0.6616485 -1.534794 -2.171557
## 5 1 0 -1.1086626 -1.3664917 -2.525729 -1.5847453 -2.413517 -3.231455
## 6 1 0 -0.8556661 -1.2039728 -2.353878 -1.0455456 -1.958995 -2.557477
## sh
## 1 -1.897120
## 2 -2.659260
## 3 -1.560648
## 4 -1.864330
## 5 -2.900422
## 6 -2.120264
y <- lnAbaDat[,"age"]
ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)
Plotting output of glmnet illustrates change in the contributions of each of the predictors as amount of shrinkage changes. In ridge regression each predictor contributes more or less over the entire range of shrinkage levels.
Output of cv.glmnet shows averages and variabilities of MSE in cross-validation across different levels of regularization. lambda.min field indicates values of \(\lambda\) at which lowest average MSE has been achieved, lambda.1se shows larger \(\lambda\) (more regularization) that has MSE 1SD (of cross-validation) higher than the minimum that is an often recommended \(\lambda\) to use under the idea that it will be less susceptible to overfit. You may find it instructive to experiment by providing different levels of lambda other than those used by default to understand sensitivity of gv.glmnet output to them. predict depending on the value of type argument allows to access model predictions, coefficients, etc. at given level of lambda:
cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)
cvRidgeRes$lambda.min
## [1] 0.02174055
cvRidgeRes$lambda.1se
## [1] 0.03154181
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3.070672846
## sexI -0.072807198
## sexM -0.003192705
## len 0.004492528
## diam 0.096242950
## h 0.191812264
## ww 0.038262477
## sw -0.111369312
## vw 0.007970718
## sh 0.166256142
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3.046842678
## sexI -0.072535542
## sexM -0.002349186
## len 0.022071413
## diam 0.091557840
## h 0.180936128
## ww 0.033394020
## sw -0.081250713
## vw 0.011349539
## sh 0.134889314
# and with lambda's other than default:
cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-80:80)/20))
plot(cvRidgeRes)
Lasso regression is done by the same call to glmnet except that now alpha=1. One can see now how more coefficients become zeroes with increasing amount of shrinkage. Notice that amount of regularization increases from right to left when plotting output of glmnet and from left to right when plotting output of cv.glmnet.
lassoRes <- glmnet(x,y,alpha=1)
plot(lassoRes)
cvLassoRes <- cv.glmnet(x,y,alpha=1)
plot(cvLassoRes)
# With other than default levels of lambda:
cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-120:0)/20))
plot(cvLassoRes)
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.93334419
## sexI -0.05455864
## sexM .
## len .
## diam .
## h 0.11615439
## ww 0.12718865
## sw -0.34305830
## vw .
## sh 0.40006512
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.57877179
## sexI -0.04757667
## sexM .
## len -0.25776906
## diam 0.10729800
## h 0.11033181
## ww 0.43813688
## sw -0.47798765
## vw -0.06189511
## sh 0.33642479
As explained above and illustrated in the plots for the output of cv.glmnet lambda.1se typically corresponds to more shrinkage with more coefficients set to zero by lasso.
Lastly, we can run lasso on several training datasets and calculate corresponding test MSE and frequency of inclusion of each of the coefficients in the model:
lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=dim(x)[1]))
cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 0.02899477
lassoCoefCnt
## sexI sexM len diam h ww sw vw sh
## 30 6 3 1 30 10 30 2 30
One can conclude that typical lasso model includes about four coefficients and (by comparison with some of the plots above) that its test MSE is about what was observed for three to four variable model as chosen by best subset selection approach.
Using computer hardware dataset from assignment 4 (properly preprocessed: shifted/log-transformed, ERP and model/vendor names excluded) select best subsets of variables for predicting PRP by some of the methods available in regsubsets. Plot corresponding model metrics (rsq, rss, etc.) and discuss results presented in these plots (e.g. what number of variables appear to be optimal by different metrics) and which variables are included in models of which sizes (e.g. are there variables that are included more often than others?).
summary/pairscpuDatFull <- read.table("machine.data",sep=",")
colnames(cpuDatFull) <- c("vendor","model","myct","mmin","mmax","cach","chmin","chmax","prp","erp")
summary(cpuDatFull)
## vendor model myct mmin
## ibm : 32 100 : 1 Min. : 17.0 Min. : 64
## nas : 19 1100/61-h1: 1 1st Qu.: 50.0 1st Qu.: 768
## honeywell: 13 1100/81 : 1 Median : 110.0 Median : 2000
## ncr : 13 1100/82 : 1 Mean : 203.8 Mean : 2868
## sperry : 13 1100/83 : 1 3rd Qu.: 225.0 3rd Qu.: 4000
## siemens : 12 1100/84 : 1 Max. :1500.0 Max. :32000
## (Other) :107 (Other) :203
## mmax cach chmin chmax
## Min. : 64 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 4000 1st Qu.: 0.00 1st Qu.: 1.000 1st Qu.: 5.00
## Median : 8000 Median : 8.00 Median : 2.000 Median : 8.00
## Mean :11796 Mean : 25.21 Mean : 4.699 Mean : 18.27
## 3rd Qu.:16000 3rd Qu.: 32.00 3rd Qu.: 6.000 3rd Qu.: 24.00
## Max. :64000 Max. :256.00 Max. :52.000 Max. :176.00
##
## prp erp
## Min. : 6.0 Min. : 15.00
## 1st Qu.: 27.0 1st Qu.: 28.00
## Median : 50.0 Median : 45.00
## Mean : 105.6 Mean : 99.33
## 3rd Qu.: 113.0 3rd Qu.: 101.00
## Max. :1150.0 Max. :1238.00
##
pairs(cpuDatFull[,-(1:2)]+1,log="xy")
cpuDat <- cpuDatFull[,c("myct","mmin","mmax","cach","chmin","chmax","prp")]
cpuDat <- log(cpuDat+1)
summaryMetrics <- NULL
whichAll <- list()
regsubsetsAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward", "seqrep") ) {
rsRes <- regsubsets(prp~.,cpuDat,method=myMthd,nvmax=6)
regsubsetsAll[[myMthd]] <- rsRes
summRes <- summary(rsRes)
whichAll[[myMthd]] <- summRes$which
for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
summaryMetrics <- rbind(summaryMetrics,
data.frame(method=myMthd,metric=metricName,
nvars=1:length(summRes[[metricName]]),
value=summRes[[metricName]]))
}
}
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") + theme(legend.position="top")
All four variable selection methods when applied to the entire dataset yield models with very similar fit metrics. For all of them, except for BIC, increase in variable number appears to result in progressive improvement of the fit. BIC reaches minimum when five out of six variables are in the model.
for ( myMthd in names(regsubsetsAll) ) {
plot(regsubsetsAll[[myMthd]],main=myMthd)
}
Default plot when called on regsubsets output (using S3 convention to actually call function plot.regsubsets) plots variable membership in each model sorted by the chosen model selection statistic (BIC by default) and colors them by selected levels of this statistics. By eye it looks like in this case all four variable selection methods choose the same variables when applied to the entire computer hardware dataset for a given variable number.
Same conclusion can be obtained when just visualizing variable membership in the models in the order of their size:
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in names(whichAll) ) {
image(1:nrow(whichAll[[myMthd]]),
1:ncol(whichAll[[myMthd]]),
whichAll[[myMthd]],xlab="N(vars)",ylab="",
xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
col=c("white","gray"),main=myMthd)
axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}
par(old.par)
Splitting computer hardware dataset into training and test as shown above, please calculate and plot training and test errors (MSE) for each model size for several of the methods available for regsubsets. Using which field investigate stability of variable selection at each model size across multiple selections of training/test data. Discuss these results – e.g. what model size appears to be most useful by this approach, what is the error rate corresponing to it, how stable is this conclusion across multiple methods for best subset selection, how does this error compare to that of ERP (PRP estimate by dataset authors)?
For extra ten points do the same using cross-validation or bootstrap
predict.regsubsets <- function (object, newdata, id, ...){
form=as.formula (object$call [[2]])
mat=model.matrix (form ,newdata )
coefi =coef(object ,id=id)
xvars =names (coefi )
mat[,xvars ]%*% coefi
}
# define a function that we will reuse for bootstrap:
resampleMSEregsubsetsCPUdat <- function(inpMthd,nTries=100) {
if ( ! inpMthd %in% c('traintest','bootstrap') ) {
stop("unexpected reampling method!")
}
dfTmp <- NULL
whichSum <- array(0,dim=c(ncol(cpuDat)-1,ncol(cpuDat),4),dimnames=list(NULL,colnames(model.matrix(prp~.,cpuDat)),c("exhaustive", "backward", "forward", "seqrep")))
for ( iTry in 1:nTries ) {
trainIdx <- NULL
if ( inpMthd == "traintest" ) {
trainIdx <- sample(nrow(cpuDat),nrow(cpuDat)/2)
} else if ( inpMthd == "bootstrap" ) {
trainIdx <- sample(nrow(cpuDat),nrow(cpuDat),replace=TRUE)
}
for ( jSelect in c("exhaustive", "backward", "forward", "seqrep") ) {
rsTrain <- regsubsets(prp~.,cpuDat[trainIdx,],nvmax=ncol(cpuDat)-1,method=jSelect)
whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
# notice that 1:n-1 and 1:(n-1) is not the same -- is it apparent why?
for ( kVarSet in 1:(ncol(cpuDat)-1) ) {
# "call" in predict.regsubsets doesn't work here:
kCoef <- coef(rsTrain,id=kVarSet)
testPred <- model.matrix (prp~.,cpuDat[-trainIdx,])[,names(kCoef)] %*% kCoef
mseTest <- mean((testPred-cpuDat[-trainIdx,"prp"])^2)
dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/length(trainIdx)),trainTest=c("test","train")))
}
}
}
list(mseAll=dfTmp,whichSum=whichSum,nTries=nTries)
}
Resample by splitting dataset into training and test:
cpuTrainTestRes <- resampleMSEregsubsetsCPUdat("traintest",30)
Plot resulting training and test MSE:
ggplot(cpuTrainTestRes$mseAll,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)+geom_hline(yintercept = mean((log(cpuDatFull[,"erp"]+1)-log(cpuDatFull[,"prp"]+1))^2),linetype=2)
Test error noticeably improves by increasing model size up to about 4 variables – e.g. median test MSE of the larger model is lower or comparable to the lower quartile of MSE for the smaller model. And perhaps going from 4 to 5 variables also on average decreases test MSE as well, although that decrease is small comparing to the variability observed across resampling tries. The test MSEs on models with 5 and 6 variables are very comparable.
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in dimnames(cpuTrainTestRes$whichSum)[[3]] ) {
tmpWhich <- cpuTrainTestRes$whichSum[,,myMthd] / cpuTrainTestRes$nTries
image(1:nrow(tmpWhich),1:ncol(tmpWhich),tmpWhich,
xlab="N(vars)",ylab="",xaxt="n",yaxt="n",main=myMthd,
breaks=c(-0.1,0.1,0.25,0.5,0.75,0.9,1.1),
# notice parameterized creation of the gray scale colors:
col=gray(seq(1,0,length=6)))
axis(1,1:nrow(tmpWhich),rownames(tmpWhich))
axis(2,1:ncol(tmpWhich),colnames(tmpWhich),las=2)
}
par(old.par)
Plots of average variable membership in the model suggest that:
Similar results from bootstrap:
cpuBootRes <- resampleMSEregsubsetsCPUdat("bootstrap",30)
ggplot(cpuBootRes$mseAll,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)+geom_hline(yintercept = mean((log(cpuDatFull[,"erp"]+1)-log(cpuDatFull[,"prp"]+1))^2),linetype=2)
old.par <- par(mfrow=c(2,2),ps=16,mar=c(5,7,2,1))
for ( myMthd in dimnames(cpuBootRes$whichSum)[[3]] ) {
tmpWhich <- cpuBootRes$whichSum[,,myMthd] / cpuBootRes$nTries
image(1:nrow(tmpWhich),1:ncol(tmpWhich),tmpWhich,
xlab="N(vars)",ylab="",xaxt="n",yaxt="n",main=myMthd,
breaks=c(-0.1,0.1,0.25,0.5,0.75,0.9,1.1),
# notice parameterized creation of the gray scale colors:
col=gray(seq(1,0,length=6)))
axis(1,1:nrow(tmpWhich),rownames(tmpWhich))
axis(2,1:ncol(tmpWhich),colnames(tmpWhich),las=2)
}
par(old.par)
Fit ridge regression model of PRP in computer hardware dataset. Plot outcomes of glmnet and cv.glmnet calls and discuss the results. Compare coefficient values at cross-validation minimum MSE and that 1SE away from it. Experiment with different ranges of lambda passed to cv.glmnet and discuss the results.
For extra ten points estimate test error (MSE) for ridge model fit on train dataset using any resampling strategy of your choice.
x <- model.matrix(prp~.,cpuDat)[,-1]
y <- cpuDat[,"prp"]
ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)
cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)
cvRidgeRes$lambda.min
## [1] 0.08941149
cvRidgeRes$lambda.1se
## [1] 0.6307804
With default \(\lambda\)’s the lowest MSE is attained for the least regularized model (for the lowest \(\lambda\))
cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-50:60)/20))
plot(cvRidgeRes)
cvRidgeRes$lambda.min
## [1] 0.05623413
cvRidgeRes$lambda.1se
## [1] 0.4466836
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.70280557
## myct -0.02710888
## mmin 0.18657592
## mmax 0.29115440
## cach 0.17257409
## chmin 0.18749722
## chmax 0.12986281
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.35145616
## myct -0.08791628
## mmin 0.16734118
## mmax 0.23248074
## cach 0.13759233
## chmin 0.17211218
## chmax 0.12727543
As expected, for more regularized model (using 1SE rule) coefficients are smaller by absolute value than those at the minimum of MSE
ridgeResScaled <- glmnet(scale(x),y,alpha=0)
cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0,lambda=10^((-50:60)/20))
predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 4.06438372
## myct -0.09399769
## mmin 0.18205915
## mmax 0.23400323
## cach 0.22909623
## chmin 0.13744901
## chmax 0.13100776
Scaling the inputs makes higher impact of MMAX and CACH more apparent
ridgeCoefCnt <- 0
ridgeCoefAve <- 0
ridgeMSE <- NULL
for ( iTry in 1:30 ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=dim(x)[1]))
cvridgeTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
ridgeTrain <- glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
ridgeTrainCoef <- predict(ridgeTrain,type="coefficients",s=cvridgeTrain$lambda.1se)
ridgeCoefCnt <- ridgeCoefCnt + (ridgeTrainCoef[-1,1]!=0)
ridgeCoefAve <- ridgeCoefAve + ridgeTrainCoef[-1,1]
ridgeTestPred <- predict(ridgeTrain,newx=x[!bTrain,],s=cvridgeTrain$lambda.1se)
ridgeMSE <- c(ridgeMSE,mean((ridgeTestPred-y[!bTrain])^2))
}
ridgeCoefAve <- ridgeCoefAve / length(ridgeMSE)
ridgeCoefAve
## myct mmin mmax cach chmin chmax
## -0.09729946 0.15798345 0.20865975 0.12269367 0.16687846 0.12758093
mean(ridgeMSE)
## [1] 0.2167895
quantile(ridgeMSE)
## 0% 25% 50% 75% 100%
## 0.1716352 0.2020603 0.2145614 0.2364838 0.2671420
On average coefficients of the fits on the training data are roughly comparable to those obtained on the entire dataset and test MSE is approximately comparable to that observed for the three variables models by regsubsets.
Fit lasso regression model of PRP in computer hardware dataset. Plot and discuss glmnet and cv.glmnet results. Compare coefficient values at cross-validation minimum MSE and that 1SE away from it – which coefficients are set to zero? Experiment with different ranges of lambda passed to cv.glmnet and discuss the results.
lassoRes <- glmnet(x,y,alpha=1)
plot(lassoRes)
With default \(\lambda\)’s sixth variable doesn’t enter the model
cvLassoRes <- cv.glmnet(x,y,alpha=1)
plot(cvLassoRes)
cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-200:20)/80))
plot(cvLassoRes)
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.0558167
## myct .
## mmin 0.1885645
## mmax 0.3124879
## cach 0.1840981
## chmin 0.1930260
## chmax 0.1246982
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.39423987
## myct .
## mmin 0.15829729
## mmax 0.28557194
## cach 0.16120756
## chmin 0.15818759
## chmax 0.08155133
Similarly to what was seen above, optimal (in min-1SE sense) model by lasso includes five variables except for MYCT
lassoResScaled <- glmnet(scale(x),y,alpha=1)
cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1,lambda=10^((-200:20)/80))
predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 7 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 4.06438372
## myct .
## mmin 0.17651433
## mmax 0.29623381
## cach 0.27752644
## chmin 0.12908398
## chmax 0.08713005
Similarly to ridge, use of scaled inputs makes contributions of MMAX and CACH more pronounced. Notice that they also are the attribiutes more frequently included in 2-3 variable models by regsubsets.
Similarly to the example shown in Preface above use resampling to estimate test error of lasso models fit to training data and stability of the variable selection by lasso across different splits of data into training and test. Use resampling approach of your choice. Compare typical model size to that obtained by best subset selection above. Compare test error observed here to that of ERP and PRP – discuss the result.
lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
bTrain <- sample(rep(c(TRUE,FALSE),length.out=dim(x)[1]))
cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 0.2186312
quantile(lassoMSE)
## 0% 25% 50% 75% 100%
## 0.1488507 0.2000857 0.2088421 0.2264319 0.3225358
lassoCoefCnt
## myct mmin mmax cach chmin chmax
## 9 30 30 30 29 28
When fit to random subsets of data optimal (in 1SE sense) lasso models typically include five variables, usually leaving out MYCT. Its MSE (median of 0.209) is higher than that for the predictions (ERP) obtained by the authors of the dataset 0.154. On average test MSE for lasso models is roughly comparable to that for ridge.